[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823
[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823abrarsheikh wants to merge 2 commits intomasterfrom
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the Serve controller by throttling the recording of the serve_deployment_replica_healthy metric. This is achieved by adding a time-based cache to DeploymentState, which avoids redundant Gauge.set() calls for replicas with unchanged health status. The implementation is clean, and the new logic is well-tested with unit tests covering various scenarios like cache hits, TTL expiration, and cache cleanup. My only suggestion is to use time.monotonic() instead of time.time() for measuring time intervals to make the caching logic robust against system clock changes.
| every control-loop iteration while still refreshing the metric often | ||
| enough for Prometheus export. | ||
| """ | ||
| now = time.time() |
There was a problem hiding this comment.
time.time() is not guaranteed to be monotonic. If the system clock is adjusted backwards, now - cached[1] could be negative, which could lead to the gauge not being updated for a long time, even past the intended interval. It's recommended to use time.monotonic() for measuring time durations to avoid this issue.
| now = time.time() | |
| now = time.monotonic() |
Why
Profiling the Serve controller under load shows
Gauge.set()for theserve_deployment_replica_healthymetric consuming ~5.9% of CPU. The call is made for every running replica on every control-loop iteration, even when the value hasn't changed. At 128+ replicas this is O(num_replicas) Cython FFI calls per loop — pure waste in steady state.What
Add a time-based cache (
_health_gauge_cache) toDeploymentStatethat tracks the last-reported(value, timestamp)per replica._set_health_gauge()skips theGauge.set()call when the value is unchanged and the entry is younger thanRAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_S(default 10s). The gauge is still re-recorded periodically so it remains visible across OpenCensus export windows, and is always recorded immediately on state transitions (healthy ↔ unhealthy). Cache entries are cleaned up when replicas are fully stopped.The interval is configurable via the
RAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_Senv var and set to0.1in CI to stay within the fast test metrics export cadence.Test plan
test_health_gauge_caching— verifies cache hits, TTL expiry re-records, value-change bypass, and cleanup on replica stoptest_replica_metrics_fieldsintegration test passes (metric still appears in Prometheus)Related to #60680